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Abstract 

We take an important step forward in making Oblivious RAM (O-RAM) practical. We pro- 
pose an O-RAM construction achieving an amortized overhead of 20 ~ 35X (for an O-RAM 
roughly 1 terabyte in size), about 63 times faster than the best existing scheme. On the theoretic 
front, we propose a fundamentally novel technique for constructing Oblivious RAMs: specifi- 
cally, we partition a bigger O-RAM into smaller O-RAMs, and employ a background eviction 
technique to obliviously evict blocks from the client-side cache into a randomly assigned server- 
side partition. This novel technique is the key to achieving the gains in practical performance. 



1 Introduction 

As cloud computing gains momentum, an increasing amount of data is outsourced to cloud storage, 
and data privacy has become an important concern for many businesses and individuals alike. 
Encryption alone may not suffice for ensuring data privacy, as data access patterns can leak a 



considerable amount of information about the data as well. Pinkas et al. gave an example 14 : if 
a sequence of data access requests qi,q2,Q3 is always followed by a stock exchange operation, the 
server can gain sensitive information even when the data is encrypted. 



Oblivious RAM (or O-RAM) j4jj5,12 , first investigated by Goldreich and Ostrovsky, is a primi- 
tive intended for hiding storage access patterns. The problem was initially studied in the context of 
software protection, i.e., hiding a program's memory access patterns to prevent reverse engineering. 

With the trend of cloud computing, O-RAM also has important applications in privacy-preserving 
storage outsourcing applications. In this paper, we consider the setting where a client wishes to 
store N blocks each of size B bytes at an untrusted server. 

The community's interest in O-RAM has recently rekindled, partly due to its potential high 
impact in privacy-preserving storage outsourcing applications. One of the best schemes known to 
date is a novel construction recently proposed by Goodrich and Mitzenmacher jfj]. Specifically, 
let N denotes the total storage capacity of the O-RAM in terms of the number of blocks. The 
Goodrich-Mitzenmacher construction achieves 0((log iV) 2 ) amortized cost when parametrized with 
O(l) client-side storage; or it achieves O(logiV) amortized cost when parametrized with 0(N a ) 
client-side storage where < a < 1. In this context, an amortized cost of f(N) means that each 
data request will generate f(N) read or write operations on the server. 

Despite elegant asymptotic guarantees, the practical performance of existing O-RAM construc- 
tions are still unsatisfactory. As shown in Table [TJ one of the most practical schemes known to 
date is the construction by Goodrich and Mitzenmacher [6j when parametrized with A^ a (a < 1) 
client-side storage. This scheme has more than 1, 400X overhead compared to non-oblivious storage 
under reasonable parametrization, which is prohibitive in practice. In summary, although it has 
been nearly two decades since Oblivious RAM was first invented, so far, it has mostly remained a 
theoretical concept. 
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Scheme 


Amortized Cost 


Worst-case Cost 


Client Storage 


Server Storage 


Practical 
Performance 


Goldreich-Ostrovsky 




0((log7V) a ) 


n(N) 


0(1) 


0(N log N) 


> 120, 000X 


Pinkas-Reinman 14 




OiQogN)*) 


0{N log AT) 


O(l) 


8N 


60, 000 ~ 80, 000X 


Goodrich-Mitzenmacher [fj] 


OQogN) 


0(N log N) 


0(7V a ) 
(0 < a < 1) 


8N 


> 1, 400X 


This paper: 


Practical, Non-Concurrent 


OQogN) 




cTV (c very small) 


< 47V + o(A0 


20 ~ 35X 


Practical, Concurrent 


OQogN) 


OQogN) 


cN (c very small) 


< 47V + o(7V) 


20 ~ 35X 


Theoretic, Non-Concurrent 


0((log7V) 2 ) 


o(Vn) 


O(v^V) 


O(N) 




Theoretic, Concurrent 


0((log7V) 2 ) 


0((log7V) 2 ) 


O(VN) 


O(N) 





Table 1: Our contributions. The practical performance is the number of client-server operations 
per O-RAM operation for typical realistic parameters, e.g., when the server stores terabytes of data, and 
the client has several hundred megabytes to gigabytes of local storage, and TV > 2 20 . For our theoretic 
constructions, the same asymptotic bounds also work for the more general case where client-side storage is 
N a for some constant < a < 1. 



O-RAM Capacity 


# Blocks 


Block Size 


Client Storage 


Server Storage 


Client Storage 
O-RAM Capacity 


Practical 
Performance 


64 GB 


2 M 


64 KB 


204 MB 


205 GB 


0.297% 


22.5X 


256 GB 


2 2 ' 2 


64 KB 


415 MB 


819 GB 


0.151% 


24.1X 


1 TB 


2 24 


64 KB 


858 MB 


3.2 TB 


0.078% 


25.9X 


16 TB 




64 KB 


4.2 GB 


51 TB 


0.024% 


29.5X 


256 TB 




64 KB 


31 GB 


819 TB 


0.011% 


32.7X 


1024 TB 


2 34 


64 KB 


101 GB 


3072 TB 


0.009% 


34.4X 



Table 2: Suggested parametrizations of our practical construction. The practical performance is the number 
of client-server operations per O-RAM operation as measured by our simulation experiments. 



1.1 Results and Contributions 

Our main goal is to make Oblivious RAM practical for cloud outsourcing applications. 

Practical construction. We propose an Oblivious RAM construction geared towards optimal 
practical performance. The practical construction achieves an amortized overhead of 20 ~ 35X 
(Tables [l] and [2]) , about 63 times faster than the best known construction. In addition, this 
practical construction also achieves sub-linear 0(logN) worst-case cost, and constant round-trip 
latency per operation. Although our practical construction requires asymptotically linear amount 
of client-side storage, the constant is so small (0.01% to 0.3% of the O-RAM capacity) that in 
realistic settings, the amount of client-side storage is comparable to y/N. 

Theoretical construction. By applying recursion to the practical construction, we can reduce 
the client-side storage to a sublinear amount, and obtain a novel construction of theoretic interest, 
achieving 0((log7V) 2 ) amortized and worst-case cost, and requires 0{\/N) client-side storage, and 
O(N) server-side storage. Note that in the 0((log7V) 2 ) asymptotic notation, one of the logiV 
factors stems from the the depth of the recursion; and in realistic settings (see Table [2]), the depth 
of the recursion is typically 2 or 3. 

Table [T] summarizes our contributions in the context of related work. Table [2] provides suggested 
parametrizations for our practical construction. 
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1.2 Main Technique: Partitioning 



We propose a novel partitioning technique, which is the key to achieving the claimed theoretical 
bounds as well as major practical savings. The basic idea is to partition a single O-RAM of size 
iV blocks into P different O-RAMs of size roughly p blocks each. This allows us to break down a 
bigger O-RAM into multiple smaller O-RAMs. 

The idea of partitioning is motivated by the fact that the major source of overhead in existing O- 
RAM constructions arises from an expensive remote oblivious sorting protocol performed between 
the client and the server. Because the oblivious sorting protocol can take up to O(N) time, existing 
O-RAM schemes require Q(N) time in the worst-case or have unreasonable 0(y/~N) amortized cost. 

We partition the Oblivious RAM into roughly P = yN partitions, each having yN blocks 
approximately. This way, the client can use y/N blocks of storage to sort /reshuffle the data blocks 
locally, and then simply transfer the reshuffled data blocks to the server. This not only circumvents 
the need for the expensive oblivious sorting protocol, but also allows us to achieve 0{y/N) worst- 
case cost. Furthermore, by allowing reshuffling to happen concurrently with reads, we can further 
reduce the worst-case cost of the practical construction to O(logiV). 

While the idea of partitioning is attractive, it also brings along an important challenge in 
terms of security. Partitioning creates an extra channel through which the data access pattern can 
potentially be inferred by observing the sequence of partitions accessed. Therefore, we must take 
care to ensure that the sequence of partitions accessed does not leak information about the identities 
of blocks being accessed. Specifically, our construction ensures that the sequence of partitions 
accessed appears pseudo-random to an untrusted server. 

It is worth noting that Ostrovsky and Shoup [l3] also came up with a technique to spread the 
reshuffling work of the hierarchical solution [5] over time, thereby achieving poly-logarithmic worst- 
case cost. However, our technique of achieving poly-logarithmic worst-case cost is fundamentally 
from Ostrovsky and Shoup [13]. Moreover, our partitioning and background eviction techniques 
are also key to the practical performance gain that we can achieve. 



1.3 Related Work 



Oblivious RAM was first investigated by Goldreich and Ostrovsk y [4} [5|[l~2]. Since their original 



work, several seminal improvements have been proposed [6|[14 16 17 . These approaches mainly 
fall into two broad categories: constructions that use O(l) client-side storage, and constructions 
that use 0{N a ) client-side storage where < a < 1. 

Williams and Sion 16 propose an O-RAM construction that requires 0(y / N) client-side storage, 
and achieves an amortized cost of 0((logA) 2 ). Williams et al. propose another construction 

however, 



17 



that uses 0{yN) client-side storage, and achieves 0(log A log log N) amortized cost 
researchers have expressed concerns over the assumptions used in their original analysis [6j|14]. A 
corrected analysis of this construction can be found in an appendix in a recent work by Pinkas and 
Reinman [14] . 

Pinkas and Reinman [T3 discovered an O-RAM construction that achieves 0((log A) 2 ) overhead 
with 0(1) client-side storage. However, some researchers have observed a security flaw of the Pinkas- 
Reinman construction, due to the fact that the lookups can reveal, with considerable probability, 
whether the client is searching for blocks that exist in the hash table [6] . The authors of that paper 
will fix this issue in a future journal version. While Table [T] shows the overhead of the Pinkas- 
Reinman scheme as is, the overhead of the scheme is likely to increase after fixing this security 
flaw. 

In an elegant work by Goodrich and Mitzenmacher [6] , they proposed a novel O-RAM construc- 



3 



tion which achieves 0((log N) 2 ) amortized cost with 0(1) client-side storage; or 0(log N) amortized 
cost with 0(N a ) client-side storage where < a < 1. The Goodrich- Mitzenmacher construction 
achieves the best asymptotic performance among all known constructions. However, their practical 
performance is still prohibitive. For example, with 0{y/N) client-side storage, their amortized cost 
is > 1,400X from a very conservative estimate. In reality, their overhead could be higher. 

In an independent and concurrent work by Boneh, Mazieres, and Popa |2|, they propose a 
construction that can support up to 0(y/~N) reads while shuffling (using 0(y/~N) client-side storage). 
The scheme achieves O(logiV) online cost, and <9(\/iV) amortized cost. 

Almost all prior constructions have Q(N) worst-case cost, except the seminal work by Ostro- 
vsky and Shoup [13] , in which they demonstrate how to spread the reshuffling operations of the 
hierarchical construction |5J across time to achieve poly- logarithmic worst-case cost. While the 
aforementioned concurrent work by Boneh et al. ||2] alleviates this problem by separating the cost 
into an online part for reading and writing data, and an offline part for reshuffling, they do so at 
an increased amortized cost of 0(y/~N) (when their scheme is configured with 0(y/~N) client-side 
storage). In addition, if £l(y/~N) consecutive requests take place within a small time window (e.g., 
during peak usage times), their scheme can still block on a reshuffling operation of D,(N) cost. 

Concurrent and subsequent work. In concurrent /subsequent work, Goodrich et al. [7] invented 
an O-RAM scheme achieving 0((logN) 2 ) worst-case cost with O(l) memory; and and Kushilevitz 
et al. [10] invented a scheme with 0( ^^^ N ) worst-case cost. Goodrich et al. also came up with a 
stateless Oblivious RAM [8] scheme, with O(logiV) amortized cost and 0(N a ) (0 < a < 1) client- 
side transient (as opposed to permanent) buffers. Due to larger constants in their constructions, 
our construction is two to three orders of magnitude more efficient in realistic settings. 



2 Problem Definition 

As shown in Figure [TJ we consider a client that wishes to store data at a remote untrusted server 
while preserving its privacy. While traditional encryption schemes can provide confidentiality, they 
do not hide the data access pattern which can reveal very sensitive information to the untrusted 
server. We assume that the server is untrusted, and the client is trusted, including the client's CPU 
and memory hierarchy (including RAM and disk). 

The goal of O-RAM is to completely hide the data access pattern (which blocks were read/written) 
from the server. In other words, each data read or write request will generate a completely random 
sequence of data accesses from the server's perspective. 

Notations. We assume that data is fetched and stored in atomic units, referred to as blocks, 
of size B bytes each. For example, a typical value for B for cloud storage is 64 KB to 256 KB. 
Throughout the paper, we use the notation ./V to denote total number of data blocks that the 
O-RAM can support, also referred to as the capacity of the O-RAM. 

Practical considerations. One of our goals is to design a practical Oblivious RAM scheme in 
realistic settings. We observe that bandwidth is much more costly than computation and storage 
in real-world scenarios. For example, typical off-the-shelf PCs and laptops today have gigabytes 
of RAM, and several hundred gigabytes of disk storage. When deploying O-RAM in a realistic 
setting, it is very likely that the bottleneck is network bandwidth and latency. 

As a result, our practical Oblivious RAM construction leverages available client-side storage as 
a working buffer, and this allows us to drastically optimize the bandwidth consumption between 
the server and the client. As a typical scenario, we assume that the client wishes to store terabytes 
of data on the remote server, and the client has megabytes to gigabytes of storage (in the form of 
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Figure 1: Oblivious RAM system architec- „. „ .... . „ , 

J Figure 2: The partitioning framework. 
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RAM or disk). We wish to design a scheme in which the client can maximally leverage its local 
storage to reduce the overhead of O-RAM. 

Security definitions. We adopt the standard security definition for O-RAMs. Intuitively, the 
security definition requires that the server learns nothing about the access pattern. In other words, 
no information should be leaked about: 1) which data is being accessed; 2) how old it is (when 
it was last accessed); 3) whether the same data is being accessed (linkability) ; 4) access pattern 
(sequential, random, etc); or 5) whether the access is a read or a write. Like previous work, our 
O-RAM constructions do not consider information leakage through the timing channel, such as 
when or how frequently the client makes data requests. 

Definition 1 (Security definition). Let y := ((opj, ui, datai), (op 2 , U2, data2), (opj^, UMi datajvf)) 
denote a data request sequence of length M, where each opj denotes a read(uj) or a write(uj, data) 
operation. Specifically, Uj denotes the identifier of the block being read or written, and data^ denotes 
the data being written. Let A(y) denote the (possibly randomized) sequence of accesses to the remote 
storage given the sequence of data requests y. An O-RAM construction is said to be secure if for 
any two data request sequences y and z of the same length, their access patterns A{y) and A(z) are 
computationally indistinguishable by anyone but the client. 



3 The Partitioning Framework 

In this section, we describe our main technique, partitioning, as a framework. At a high level, the 
goal of partitioning is to subdivide the O-RAM into much smaller partitions, so that the operations 
performed on the partitions can be handled much more efficiently than if the O-RAM was not 
partitioned. 

The main challenge of partitioning the O-RAM is to ensure that the sequence of partitions 
accessed during the lifetime of the O-RAM appears random to the untrusted server while keeping 
the client-side storage small. In this way, no information about data access pattern is revealed. 

3.1 Server Storage 

We divide the server's storage into P fully functional partition O-RAM's, each containing N/P 
blocks on average. For now, we can think of each partition O-RAM as a black box, exporting a 
read and a write operation, while hiding the access patterns within that partition. 
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At any point of time, each block is randomly assigned to any of the P partitions. Whenever a 
block is accessed, the block is logically removed from its current partition (although a stale copy 
of the block may remain), and logically assigned to a fresh random partition selected from all P 
partitions. Thus, the client needs to keep track of which partition each block is associated with at 



any point of time, as specified in Section 3.2 



The maximum amount of blocks that an O-RAM can contain is referred to as the capacity of 
the O-RAM. In our partitioning framework, blocks are randomly assigned to partitions, so the 
capacity of an O-RAM partition has to be slightly more than N/P blocks to accommodate the 
variance of assignments. Due to the standard balls and bins analysis, for P = yN, each partition 
needs to have capacity y/N + o(y/N) to have a sufficiently small failure probability pol y^^ ■ 



3.2 Client Storage 

The client storage is divided into the following components. 

Data cache with P slots. The data cache is a cache for temporarily storing data blocks fetched 
from the server. There are exactly P cache slots, equal to the number of partitions on the server. 
Logically, the P cache slots can be thought of an extension to the server-side partitions. Each slot 
can store 0, 1, or multiple blocks. In the full version of this paper [I], we prove that each cache 
slot will have a constant number of data blocks in expectation, and that the total number of data 
blocks in all cache slots will be bounded by 0{P) with high probability . In both our theoretic 
and practical constructions, we will let P = y/~N. In this case, the client's data cache capacity is 

o(Vn). 

Position map. As mentioned earlier, the client needs to keep track of which partition (or cache 
slot) each block resides in. The position map serves exactly this purpose. We use the notation 
position [u] for the partition number where block u currently resides. In our practical construction 
described in Section |4j the position map is extended to also contain the exact location (level number 
and index within the level) of block u within its current partition. 

Intuitively, each block's position (i.e., partition number) requires about clogiV bits to describe. 
In our practical construction, c < 1.1, since the practical construction also stores the block's exact 
location inside a partition. Hence, the position map requires at most cN log N bits of storage, or 
fiVlogiV bytes, which is cN ^ N blocks. Since in practice the block size B > | log N, the size of 
the position map is a constant fraction of the original capacity of the O-RAM (with a very small 
constant). 

Shuffling buffer. The shuffling buffer is used for the shuffling operation when two or more levels 
inside a partition O-RAM need to be merged into the next level. For this paper, we assume that 
the shuffling buffer has size 0{\/N). 

Miscellaneous. Finally, we need some client-side storage to store miscellaneous states and infor- 
mation, such as cryptographic keys for authentication, encryption, and pseudo-random permuta- 
tions. 



3.3 Intuition 

In our construction, regardless of whether a block is found in the client's data cache, the client always 
performs a read and a write operation to the server upon every data request - with a dummy read 
operation in case of a cache hit. Otherwise, the server might be able to infer the age of the blocks 
being accessed. Therefore, the client data cache is required for security rather than for efficiency. 
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In some sense, the data cache acts like a holding buffer. When the client fetches a block from 
the server, it cannot immediately write the block back to a some partition, since this would result 
in linkability attacks the next the this block is read. Instead, the fetched block is associated with a 
fresh randomly chosen cache slot, but the block resides in the client data cache until a background 
eviction process writes it back to the server partition corresponding to its cache slot. 

Another crucial observation is that the eviction process should not reveal which client cache 
slots are filled and which are not, as this can lead to linkability attacks as well. To achieve this, we 
use an eviction process that is independent of the load of each cache slot. Therefore we sometimes 
have to write back a dummy block to the server. For example, one possible eviction algorithm is to 
sequentially scan the cache slots at a fixed rate and evict a block from it or evict a dummy block 
if the cache slot is empty. 

To aid the understanding of the partitioning framework, it helps to think of each client cache 
slot i as an extension of the i-th server partition. At any point of time, a data block is associated 
with a random partition (or slot), and the client has a position map to keep track of the location 
of each block. If a block is associated with partition (or slot) i € [P], it means that an up-to-date 
version of a block currently resides in partition i (or cache slot i). However, it is possible that 
other partitions (or even the same partition) may still carry a stale version of block i, which will 
be removed during a future reshuffling operation. 

Every time a read or write operation is performed on a block, the block is re-assigned to a 
partition (or slot) selected independently at random from all P partitions (or slots). This ensures 
that two operations on the same block cannot be linked to each other. 

3.4 Setup 

When the construction is initialized, we first assign each block to an independently random parti- 
tion. Since initially all blocks are zeroed, their values are implicit and we don't write them to the 
server. The position map stores an additional bit per block to indicate if it has never been accessed 
and is hence zeroed. In the practical construction, this bit can be implicitly calculated from other 
metadata that the client stores. Additionally, the data cache is initially empty. 

3.5 Partition O-RAM Semantics and Notations 

Before we present the main operations of our partitioning-based O-RAM, we first need to define 
the operations supported by each partition O-RAM. 

Recall that each partition is a fully functional O-RAM by itself. To understand our partitioning 
framework, it helps to think of each partition as a blackbox O-RAM. For example, for each partition, 
we can plug in the Goodrich-Mitzenmacher O-RAM |6| (with either 0(1) or 0(\> r N) client-side 
storage) or our own partition O-RAM construction described in Section [4j 

We make a few small assumptions about the partition O-RAM, and use slightly different se- 
mantics to refer to the partition O-RAM operations than existing work. Existing O-RAM con- 
structions [4|[6|[l~4| always perform both a read and a write operation upon any data access request. 
For the purpose of the partitioning framework, it helps to separate the reads from the writes. In 
particular, we require that a Read Partition operation "logically remove" the fetched block from the 
corresponding partition. Many existing constructions |4,6,14 can be easily modified to support this 
operation, simply by writing back a dummy block to the first level of the hierarchy after reading. 

Formally, we think of each partition O-RAM as a blackbox O-RAM exporting two operations, 
ReadPartition and WritePartition, as explained below. 
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slotfr] <- slotfr] U {(u, data)} 


13 


Call Evict(p) /*Optional eviction piggy-backed on normal data access requests. Can 




improve performance by a constant factor. * / 


14 


Call SequentialEvict(i/) or Random Evict(i/) 


15 


return data 



Figure 3: Algorithm for data access. Read or write a data block identified by u. If op = read, the input 
parameter data* = None, and the Access operation returns the newly fetched block. If op = write, the Access 
operation writes the specified data* to the block identified by u, and returns the old value of the block u. 



• Read Partition (p, u) reads a block identified by its unique identifier u € {_L, 1, 2, . . . , N — 1} 
from partition p. In case u = _L, the read operation is called a dummy read. We assume that 
the Read Partition operation will logically remove the fetched block from the corresponding 
partition. 

• WritePartition(p, u, data) writes back a block identified by its unique identifier u G {_L, 1,2, ... , N- 
1} to partition p. In case u = _L, the write operation is called a dummy write. The parameter 
data denotes the block's data. 

Remark 1 (About the dummy block identifier _L). The dummy block identifier _L represents a 
meaningless data block. It is used as a substitute for a real block when the client does not want the 
server to know that there is no real block for some operation. 

Remark 2 (Block identifier space). Another weak assumption we make is that each partition 0- 
RAM needs to support non- contiguous block identifiers. In other words, the block identifiers need 



not be a number within [1,N], where N is the O-RAM capacity. Most existing schemes 14^ 
satisfy this property. 



3.6 Reading a Block 

Let read(u) denote a read operation for a block identified by u. The client looks it up in the position 
map, and finds out which partition block u is associated with. Suppose that block u is associated 
with partition p. The client then performs the following steps: 
Step 1: Read a block from partition p. 

• If block u is found in cache slot p, the client performs a dummy read from partition p of the 
server, i.e., call Read Partition (p, _L) where _L denotes a reading a dummy block. 

• Otherwise, the client reads block u from partition p of the server by calling Read Partition (p, u). 



S 



Step 2: Place block u that was fetched in Step 1 into the client's cache, and update the position 
map. 

• Pick a fresh random slot number s, and place block u into cache slot s. This means that block 
u is scheduled to be evicted to partition s in the future, unless another read(u) preempts the 
eviction of this block. 

• Update the position map, and associate block u with partition s. In this way, the next read(u) 
will cause partition s to be read and written. 



Afterwards, a background eviction takes place as described in Section 3.8 



3.7 Writing a Block 

Let write(u, data*) denote writing data* to the block identified by u. This operation is implemented 
as a read(u) operation with the following exception: when block u is placed in the cache during the 
read(u) operation, its data is set to data*. 

Observation 1. Each read or write operation will cause an independent, random partition to be 
accessed. 

Proof, {sketch.) Consider each client cache slot as an extension of the corresponding server par- 
tition. Every time a block u is read or written, it is placed into a fresh random cache slot s, i.e., 
associated with partition s. Note that every time s is chosen at random, and independent of op- 
erations to the Oblivious RAM. The next time block u is accessed, regardless of whether block u 
has been evicted from the cache slot before this access, the corresponding partition s is read and 
written. As the value of the random variable s has not been revealed to the server before this, from 
the server's perspective s is independently and uniformly at random. □ 

3.8 Background Eviction 

To prevent the client data cache from building up, blocks need to be evicted to the server at some 
point. There are two eviction processes: 

1. Piggy-backed evictions are those that take place on regular O-RAM read or write operations 



(see Line 13 of Figure |3j). Basically, if the data access request operates on a block currently 
associated with partition p, we can piggy-back a write-back to partition p at that time. 
The piggy-backed evictions are optional, but their existence can improve performance by a 
constant factor. 



2. Background evictions take place at a rate proportional to the data access rate (see Line 14 of 
Figure [3]). The background evictions are completely independent of the data access requests, 
and therefore can be equivalently thought of as taking place in a separate background thread. 
Our construction uses an eviction rate of v > 0, meaning that in expectation, v number of 
background evictions are attempted with every data access request. Below are two potential 
algorithms for background eviction: 

(a) Sequentially scan the cache slots at a fixed rate v (see the Sequential Evict algorithm in 
Figure [I| ; 

(b) At a fixed rate v, randomly select a slot from all P slots to evict from, (a modified 
version of the random eviction algorithm is presented as RandomEvict in Figure [1]); 

Our eviction algorithm is designed to deal with two main challenges: 
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Evict (p): 




1: if len(slot[p]) = then 




RandomEvict(^): 


2: WritePartition(p, _L, None) 




1: for i = 1 to v do // 'Assume integer v 


3: else 




2: r <— Uniform Random (1 ... P) 


4: (u,data) «— slot[cnt].pop() 




3: Evict(r) 


5: WritePartition(cnt, u, data) 




4: end for 


6: end if 





Sequential Evict (V): 




1 


num <— V{v) 


//Pick the number of blocks to evict according to distribution D 


2 


for i = 1 to num 


do 


3 


cnt <— cnt + 1 


//cnt is a global counter for the sequential scan 


4 


Evict(cnt) 




5 


end for 





Figure 4: Background evicition algorithms with eviction rate v. Here we provide two candidate evic- 
tion algorithms SequentialEvict and RandomEvict. SequentialEvict determines the number of blocks to evict 
num based on a prescribed distribution T>(y) and sequentially scans num slots to evict from. RandomEvict 
samples v £ N random slots (with replacement) to evict from. In both SequentialEvict and RandomEvict, 
if a slot selected for eviction is empty, evict a dummy block for the sake of security. 



Bounding the cache size. To avoid the client's data cache from building up indefinitely, the 
above two eviction processes combined evict blocks at least as fast as blocks are placed into 
the cache. The actual size of the client's data cache depends on the choice of the background 
eviction rate v. We choose v > to be a constant factor of the actual data request rate. For 



our practical construction, in Section 5.1 we empirically demonstrate the relationship of v 
and the cache size. In the full version of this paper [l] , we prove that our background eviction 
algorithm results in a cache size of 0(P). 

• Privacy. It is important to ensure that the background eviction process does not reveal 
whether a cache slot is filled or the number of blocks in a slot. For this reason, if an empty 
slot is selected for eviction, a dummy block is evicted to hide the fact that the cache slot does 
not contain any real blocks. 

Observation 2. By design, the background eviction process generates a partition access sequence 
independent of the data access pattern. 

Lemma 1 (Partition access sequence reveals nothing about the data request sequence.). Let y 

denote a data request sequence. Let f{y) denote the sequence of partition numbers accessed given 
data request sequence y. Then, for any two data request sequences of the same length y and z, f(y) 
and f(z) are identically distributed. In other words, the sequence of partition numbers accessed 
during the life-time of the O-RAM does not leak any information about the data access pattern. 

Proof. The sequence of partition numbers are generated in two ways 1) the regular read or write 
operations, and 2) the background eviction processes. Due to Observations [T] and [2j both of the 
above processes generate a sequence of partition numbers completely independent of the data access 
pattern. □ 
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Read Partition (p, u): 



1 

2 
3 
4 

5: 

6 
7 

8: 
9: 
10 
11 
12 
13 

14: 
15: 



L <— number of levels 

for £ = 0, 1, . . . , L — 1 (in parallel) do 

if level £ of partition p is not filled then 
continue // skip empty levels 

end if 

if block u is in partition p, level £ then 

i = position [u\. index 
else 

i= nextDummy[p, £} 
nextDummy[p, £] ■<— nextDummy[p. £} + 1 
end if 

i' = PRP (K\p,£],i) 

Fetch from the server the block in partition p, level 
£, and offset i'. 

Decrypt the block with the key K[p,£]. 
end for 



Partition p 

before after 




Figure 5: The Read Partition operation of our practical con- 
struction that reads the block with id u from partition p. 



Figure 6: WritePartition leads to the shuffling 
of consecutively filled levels into the first empty 
level. 



Theorem 1. Suppose that each partition uses a secure O-RAM construction, then the new O- 
RAM construction obtained by applying the partitioning framework over P partition O-RAMs is 
also secure. 

Proof. Straightforward conclusion from Lemma [T] and the security of the partition O-RAM. □ 
3.9 Algorithm Pseudo-code 

Figures [3] and [4] describe in formal pseudo-code our Oblivious RAM operations based on the par- 
titioning framework. For ease of presentation, in Figure [3| we unify read and write operations into 
an Access(op, u, data*) operation. 

4 Practical Construction 

In this section, we apply the partitioning techniques mentioned in the previous section to obtain a 
practical construction with an amortized cost of 20 ~ 35X overhead under typical settings, about 
63 times faster than the best known construction. The client storage is typically 0.01% to 0.3% of 
the O-RAM capacity. The worst-case cost of this construction is 0{\fN), and we will later show 
how to allow concurrent shuffling and reads to reduce the worst-case cost to O (log AO (Section [6]). 

While the practical construction require cN client-side storage, the constant c is so small that 
our cN is smaller than or comparable to y/~N for typical storage sizes ranging from gigabytes to 
terabytes. For the sake of theoretic interest, in Appendix [7j we show how to recursively apply our 
Oblivious RAM construction to part of the client-side storage, and reduce the client-side storage 
to 0(i/A), while incurring only a logarithmic factor in the amortized cost. 



4.1 Overview 

Our practical construction uses the partitioning framework (Section [3]). For the partitions, we use 
our own highly optimized O-RAM construction resembling the Pinkas-Reinman O-RAM at a very 
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high level 14 . 



Choice of parameters. In this practical construction, we choose P = yN partitions, each with 
y/N blocks on average. We use the Sequential Evict algorithm as the background eviction algorithm. 
Every time SequentialEvict is invoked with an eviction rate of v, it decides the number of blocks 
to evict numbased on a bounded geometric distribution with mean v, i.e., let c be a constant 
representing the maximum number of evictions per data access operation, then Pr[num = k] oc p k 
for < k < c, and Pr[num = k] = for k > c. Here is < p < 1 is a probability dependent on v 
and c. 

As mentioned earlier, the piggybacked evictions enable practical savings up to a constant factor, 
so we use piggybacked evictions in this construction. 

Optimized partition O-RAM construction. While any existing O-RAM construction satisfy- 
ing the modified Read Partition and WritePartition semantics can be used as a partition O-RAM, we 
propose our own highly optimized partition O-RAM. Our partition O-RAM construction resem- 
bles the Pinkas-Reinman O-RAM at a very high level |14| , but with several optimizations to gain 
practical savings. The practical savings come from at least three sources, in comparison with the 
Pinkas-Reinman construction: 

• Local sorting. Due our partitioning framework, each partition is now of size 0(y/~N) blocks. 
This allows us to use a client shuffling buffer of size 0{y/N) blocks to reshuffle the partition 
locally, thereby eliminating the need for extremely expensive oblivious sorting procedures 
during a reshuffling operation. This is our most significant source of saving in comparison 
with all other existing schemes. 

• No Cuckoo hashing. Second, since we use a position map to save the locations of all blocks, 
we no longer need Cuckoo hashing, thereby saving a 2X factor for lookups. 

• Compressed data transfer during reshuffling. Third, during the reshuffling operation, the 
client only reads blocks from each level that have not been previously read. Also, when the 
client writes back a set of shuffled blocks to the server (at least half of which are dummy 
blocks), it uses a compression algorithm to compress the shuffling buffer down to half its size. 
These two optimizations save about another 2X factor. 

• Latency reduction. In the practical construction, the client saves a position map which records 
the locations of each block on the server. This allows the client to query the 0(logA r ) levels 
in each partition in a single round-trip, thereby reducing the latency to 0(1). 

4.2 Partition Layout 

As mentioned earlier, we choose P = y/~N partitions for the practical construction. Each partition 
consists of L = log 2 ("v / AO + l = \ log 2 N+l levels, indexed by 0, 1, . . . , \ log 2 N respectively. Except 
for the top level, each level I has 2 • 2^ blocks, among which at most half are real blocks, and the 
rest (at least half) are dummy blocks. 

The top level where £ = \ log 2 N has 2 • 2 £ + e = 2^/~N + e blocks, where the surplus e is due to 
the fact that some partition may have more blocks than others when the blocks are assigned in a 
random fashion to the partitions. Due to a standard balls and bins argument [15], each partition's 
maximum size (including real and dummy blocks) should be 4\/N~ + o{^/N) such that the failure 
probability po \^ N ^ ■ In Appendix 5.3, we empirically demonstrate that in practice, the maximum 

number of real blocks in each partition is not more than l.lbyN for N > 20, hence the partition 
capacity is no more than 4 • 1.15y/N = 4.6\fN blocks, and the total server storage is no more than 



4.6./V blocks. In Appendix 8.2, we propose an optimization to reduce the server storage to less than 
3.2N blocks. 
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At any given time, a partition on the server might have some of its levels filled with blocks 
and others unfilled. The top partition is always filled. Also, a data block can be located in any 
partition, any filled level, and any offset within the level. In the practical construction, we extend 
the position map of the partition framework to also keep track of the level number and offset of 
each block. 

From the perspective of the server, all blocks within a level are pseudo-randomly arranged. 
Because the blocks are encrypted, the server cannot even tell which blocks are real and which ones 
are dummy. We use keyed pseudo-random permutation (PRP) function for permuting blocks within 
a level in our construction. When the context is clear, we omit the range or the PRP function in 
the pseudo-code. 

4.3 Setup 

The initial set of filled levels that contain the blocks depends on the partition number p. In order 
to better amortize the reshuffling costs of our scheme, the client randomly chooses which levels of 
each partition will be initially filled (with the restriction that the top level is always filled). Note 
that there are 2 i_1 such possible fillings of a partition where L is the number of levels in the 
partition. The client notifies the server which levels are filled but does not write the actual blocks 
to the server because the blocks are initially zeroed and their values can be calculated implicitly 
by storing one bit for each level of each partition. This bit indicates if the entire level has never 
been reshuffled and is hence zeroed. 

4.4 Reading from a Partition 

The Read Partition operation reads the block with id u from partition p as described in Figure [5j If 
u = _L, then the Read Partition operation is a dummy read and a dummy block is read from each 
filled level. If u 7^ _L, block u is read from the level that contains it, and a dummy block is read 
from from each of the other filled levels. Note that all of the fetches from the server are performed 
in parallel and hence this operation has single round trip latency unlike existing schemes [4|[5||12| 
which take f2(logAQ round trips. 

4.5 Writing to a Partition 

Each write to a partition is essentially a reshuffling operation performed on consecutively filled levels 
in a partition. Therefore, we sometimes use the terms "write" and "shuffling" interchangeably. 
First, unread blocks from consecutively filled levels of the partition are read from the server into 
the client's shuffling buffer. Then, the client permutes the shuffling buffer according to a pseudo- 
random permutation (PRP) function. Finally, the client uploads its shuffling buffer into the first 
unfilled level and marks all of the levels below it as unfilled. The detailed pseudo-code for the 
WritePartition operation is given in Figure [7j 

There is an exception when all levels of a partition are filled. In that case, the reshuffling 
operation is performed on all levels, but at the end, the top level (which was already filled) is 
overwritten with the contents of the shuffling buffer and the remaining levels are marked as unfilled. 
Note that the shuffling buffer is never bigger than the top level because only unread real (not dummy) 
blocks are placed into the shuffling buffer before it is padded with dummy blocks. Since the top 
level is big enough to contain all of the real items inside a partition, it can hold the entire shuffling 
buffer. 

During a reshuffling operation, the client uses the pseudo-random permutation PRP to determine 
the offset of all blocks (real and dummy) within a level on the server. Every time blocks are shuffled 
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and written into the next level, the client generates a fresh random PRP key K[p, £] so that blocks 
end up at random offsets every time that level is constructed. The client remembers the keys for 
all levels of all partitions in its local cache. 

Reading levels during shuffling. When the client reads a partition's levels into the shuffling 
buffer (Line [H] of Figure [7]), it reads exactly 2 e previously unread blocks. Unread blocks are those 
that were written during a WritePartition operation when the level was last constructed, but have 
not been read by a ReadPartition operation since then. The client only needs to read the unread 
blocks because the read blocks were already logically removed from the partition when they were 
read. There is a further restriction that among those 2^ blocks must be all of the unread real 
(non- dummy) blocks. Since a level contains up to 1 real blocks, and there are always at least 2^ 
unread blocks in a level, this is always possible. 

The client can compute which blocks have been read/unread for each level. It does this by first 
fetching from the server a small amount of metadata for the level that contains the list of all blocks 
(read and unread) that were in the level when it was last filled. Then the client looks up each of 
those blocks in the position map to determine if the most recent version of that block is still in 
this level. Hence, the client can obtain the list of unread real blocks. The offsets of the of unread 
dummy blocks can be easily obtained by repeatedly applying the PRP function to nextDummy and 
incrementing nextDummy. Note that for security, the offsets of the 2^ unread blocks must be first 
computed and then the blocks must be read in order of their offset (or some other order independent 
of which blocks are real/dummy). 

4.6 Security 

Our practical construction has the following security guarantees: obliviousness (privacy), con- 
fidentiality, and authentication (with freshness). 

Theorem 2. The 'practical construction is oblivious according to Definition^ 

Proof. The proof is presented in the full version of the paper [l] . □ 

Theorem 3. The practical construction provides confidentiality and authentication (with freshness) 
of all data stored on the server. 

Proof. The proof is presented in the full version of the paper [l] . □ 

It should be noted that although our partition O-RAM construction resembles the Pinkas- 
Reinman construction, it does not have the security flaw discovered by Goodrich and Mitzen- 
macher [6] because it does not use Cuckoo hash tables. 

5 Experimental Results 

For our experiments, we implemented a simulator of our construction. Each read/write operation 
is simulated and the simulator keeps track of exactly where each block is located, the amount of 
client side storage used, and the total bytes transferred for all communication between the client 
and server. We also implemented a simulator for the best previously known O-RAM scheme for 
comparison. 

For each parametrization of our Oblivious RAM construction, we simulated exactly 3N read/write 
operations. For example, for each O-RAM instances with N = 2 28 blocks, we simulated about 800 
million operations. We used a round-robin access pattern which maximizes the size of the client's 



14 



WritePartition(p, u*,data*): 
1: // Read consecutively filled levels into the client's shuffling buffer denoted sbuffer. 
2: £q <— last consecutively filled level 
3: for £ = to £ do 

4: Fetch the metadata (list of block ID's) for level £ in partition p. Decrypt with key K[p,£]. 

5: Fetch exactly 2 e previously unread blocks from level £ into sbuffer such that all unread real blocks 

are among them. Decrypt everything with the key K[p,£]. Ignore dummy blocks when they arrive. 
6: Mark level £ in partition p as unfilled. 
7: end for 

8: £ = min(^ , L — 1) // Don't spill above the top level. 
9: Add the (u*,data*) to sbuffer. 

10: k <— number of real blocks in sbuffer. 

11: for i = 1 to k do 

12: Let (u, data) = sbuffer[i] 

13: position [it] <— {p, £, i} // update position map 
14: end for 

15: K \p, £] <—r K, // generate fresh key for level £ in partition p 

16: Pad the shuffling buffer with dummy blocks up to length 2 • 2 £ . // The first k blocks are real and the 
rest are dummy. 

17: Permute sbuffer with PRPe(K[p, £], ■). A block originally at index i in the shuffling buffer is now located 

at offset i' in the shuffling buffer, where i! — PRPf (K[p, £), i) 
18: Write the shuffling buffer into level £ in partition p on the server, encrypted with key K[p,£]. 
19: Write the metadata (list of block ID's) of level £ in partition p to the server, encrypted with key K [p, £} . 
20: Mark level £ in partition p as filled. 

21: nextDummy[p, £] k + 1 // initialize counter to first dummy block. 
Notations: 

K [p, £] Secret key for partition p, level £. (PRP or AES key depending on context) 

nextDummy[p, £] Index of next unread dummy block for partition p, level £. 
{p, £, i} 4— position [u] Position information for block u (partition p, level £, index i within the level). 



Figure 7: The WritePartition operation of our practical construction that writes block u* with data* to 
partition p. 



data cache of our construction by maximizing the probability of a cache miss. Therefore our results 
always show the worst case cache size. Also, because our construction is oblivious, our amortized 
cost measurements are independent of the simulated access pattern. We used the level compression, 
server storage reduction, and piggy-backed eviction optimizations described in Appendix |8j 

5.1 Client Storage and Bandwidth 

In this experiment, we measure the performance overhead (or bandwidth overhead) of our O-RAM. 
An O-RAM scheme with a bandwidth overhead of w performs w times the data transfer as an 
unsecured remote storage protocol. In the experiments we ignore the metadata needed to store and 
fetch blocks because in practice it is much smaller than the block size. For example, we may have 
256 KB blocks, but the metadata will be only a few bytes. 

In our scheme, the bandwidth overhead depends on the background eviction rate, and the 
background eviction rate determines the client's cache size. The client is free to choose its cache 
size by using the appropriate eviction rate. Figure [8] shows the correlation between the background 
eviction and cache size as measured in our simulation. 
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Figure 8: Background Eviction Rate vs. 
Cache Capacity. The x-axis is the eviction rate, 
defined as the ratio of background evictions over 
real data requests. For example, an eviction rate 
of 1 suggests an equal rate of data requests and 
background evictions. The y-axis is the quantity 
(cache capacity)/ yN, where cache capacity is the 
maximum number of data blocks in the cache over 
the course of time. 




Figure 10: Partition capacity. The y-axis is the 
quantity (partition capacity)/ VN, where partition ca- 
pacity is the maximum number of real data blocks 
that the partition must be able to hold, i.e., the max- 
imum number of real data blocks inside a partition 
over time. 




Figure 9: Trade-off between Client Storage 
and Bandwidth. The plot shows what band- 
width overhead a client can achieve by using exactly 
k\ NB bytes of client storage for different values of 
k (horizontal axis). The client storage includes the 
cache, sorting buffer, and an uncompressed position 
map. A block size of 256 KB was assumed. 
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Figure 11: Comparison between our con- 
struction and the best known previous O- 
RAM construction. A 1 TB O-RAM is consid- 
ered with both constructions using exactly 4\/A\B 
client storage. The practical performance is the 
number of client-server operations per O-RAM op- 
eration. Our construction has a 63 to 66 times better 
performance than the best previously known scheme 
for the exact same parameters. 



Once the client chooses its cache size it has determined the total amount of client storage. As 
previously mentioned, our scheme requires 0(s/NB) bytes of client storage plus an extra cNB 
bytes of client storage for the position map with a very small constant. For most practical values of 
N and B, the position map is much smaller than the remaining 0(yNB) bytes of client storage, so 
the client storage approximately scales like 0(sfNB) bytes. We therefore express the total client 
storage as k^fNB bytes. Then, we ask the question: How does the client's choice of k affect the 
bandwidth overhead of our entire O-RAM construction? Figure [9] shows this trade-off between the 
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total client storage and the bandwidth overhead. 



5.2 Comparison with Previous Work 

To the best of our knowledge, the most practical existing O-RAM construction was developed by 
Goodrich et al. [6l. It works by constructing a hierarchy of Cuckoo hash tables via Map- Reduce 
jobs and an efficient sorting algorithm which utilizes N a (a < 1) blocks of client-side storage. We 
implemented a simulator that estimates a lower bound on the performance of their construction. 
Then we compared it to the simulation of our construction. 

To be fair, we parametrized both our and their construction to use the exact same amount of 
client storage: bytes. The client storage includes all of our client data structures, including 

our position map (stored uncompressed). We parametrized both constructions for exactly 1 TB 
O-RAM capacity (meaning that each construction could store a total of 1 TB of blocks) . We varied 
the number of blocks from N = 2 16 to N = 2 24 . Since the O-RAM size was fixed to 1 TB, the 



blocks size varied between B = 2 24 bytes and B = 2 16 bytes. Table 11 shows the results. As it 
can be clearly seen, our construction uses 63 to 66 times less bandwidth than the best previously 
known scheme for the exact same parameters. 

5.3 Partition Capacity 

Finally, we examine the effects of splitting up the Oblivious RAM into partitions. Recall that in 
our practical construction with N blocks, we have split up the server storage into yN partitions 
each containing about yN blocks. Since the blocks are placed into partitions uniformly randomly 
rather than uniformly, a partition might end up with slightly more or less than y/~N blocks. For 
security reasons, we want to hide from the server how many blocks are in each partition at any 
given time, so a partition must be large enough to contain (with high probability) the maximum 
number of blocks that could end up in a single partition. 



Figure 10 shows how many times more blocks a partition contains than the expected number: 
/ N. Note that as the size of the O-RAM grows, the maximum size of a partition approaches its 
expected size. In fact, one can formally show that the maximum number of real data blocks in each 



partition over time is v A + o(V N) 15 . Hence, for large enough N, the partition capacity is less 
than 5% larger than yN blocks. 



6 Reducing the Worst- Case Cost With Concurrency 

The constructions described thus far have a worst-case cost 0{y r N) because a WritePartition oper- 
ation sometimes causes a reshuffling of 0{y r N) blocks. We reduce the worst-case cost by spreading 
out expensive WritePartition operations of 0(yN) cost over a long period of time, and at each time 
step performing O(logA) work. 

To achieve this, we allow reads and writes (i.e., reshuffling) to a partition to happen concurrently. 
This way, an operation does not have to wait for previous long-running operations to complete 
before executing. We introduce an amortizer which keeps track of which partitions need to be 
reshuffled, and schedules O(logA) work (or 0((logA) 2 ) for the theoretic recursive construction) 
per time step. There is a slight storage cost of allowing these operations to be done in parallel, but 
we will later show that concurrency does not increase the asymptotic storage and amortized costs 
of our constructions. 

By performing operations concurrently, we decrease the worst-case cost of the practical con- 
struction from 0(y r N) to 0(log N) and we reduce the worst-case cost of the recursive construction 
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if op is Read Partition (p, u) then 

Call ConcurrentReadPartition(p, u) 
else if op is WritePartition (p, u,data) then 

A <— max i for which C p = mod 2\ 
P <- {(u,data)} 

if (p, A', ft) G Q for some A' and /3' then 
Q^Q- {(p, A',/3')} 
A <- max(A,A'), /3 ^ /3' U 

end if 

C?^Qu{(p,A,/3)} 



end if 



C p + 1 



Perform O(logiV) work from job queue Q in the 
form of ConcurrentWritePartition operations. 



Figure 12: The Amortizer component helps reduce the worst case costs of our constructions. 

It is inserted between the background eviction process and the server-side partitions as shown on the left. 
The component executes one operation per time step as defined on the right. 



from 0(y/~N) to 0((logiV) 2 ). Our concurrent constructions preserve the same amortized cost as 
their non-concurrent counterparts; however, in the concurrent constructions, the worst-case cost is 
the same as the amortized cost. Furthermore, in the concurrent practical construction, the latency 
is 0(1) just like the non-concurrent practical construction, as each data request requires only a 
single round-trip to complete. 

6.1 Overview 

We reduce the worst case cost of our constructions by inserting an Amortizer component into our 
system which explicitly amortizes ReadPartition and WritePartition operations as described in Fig- 
ure|12| Specifically, the Amortizer schedules a ReadPartition operation as a ConcurrentReadPartition 
operation, so the read can occur while shuffling. A ReadPartition always finishes in O(logiV) time. 
Upon a WritePartition operation (which invokes the shuffling of a partition), the Amortizer creates 
a new shuffling "job", and appends it to a queue Q of jobs. The Amortizer schedules O(logiV) 
amount of work to be done per time step for jobs in the shuffling queue. 

If reads are taking place concurrently with shuffling, special care needs to be taken to avoid 
leakages through the access pattern. This will be explained in the detailed scheme description 
below. 

Terminology. To aid understanding, it helps to define the following terminology. 

• Job. A job (p, A, (3) denotes a reshuffling of levels 0, . . . , A of partition p and then writing the 
blocks in f3 to partition p on the server. 

• Job Queue Q. The job queue Q is a FIFO list of jobs. It is also possible to remove jobs 
that are not necessarily at the head of the queue for the purpose of merging them with other 
jobs, however jobs are always added at the tail. 

• Partition counter. Let C p £ 7L S denote a counter for partition p, where s is the maximum 
capacity of partition p. All operations on C p are modulus s. 
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• Work. The work of an operation is measured in terms of the number of blocks that it reads 
and writes to partitions on the server. 

Handling ReadPartition operations. The amortizer performs a Read Partition (p, u) operation as a 



ConcurrentReadPartition(p, u) operation as defined in Sections 6.1.1 for the practical and recursive 
constructions respectively. If block u is cached by a previous ConcurrentReadPartition operation, 
then it is instead read from ft where (p, A, ft) 6 Q for some A and ft. 

Handling WritePartition operations. The amortizer component handles a WritePartition operation 
by adding it to the job queue Q. The job is later dequeued in some time step and processed 
(possibly across multiple time steps). If the queue already has a job involving the same partition, 
the existing job is merged with the new job for the current WritePartition operation. Specifically, 
if one job requires shuffling levels 0, . . . , A and the other job requires shuffling levels 0, . . . , A', we 
merge the two jobs into a job that requires shuffling levels 0, . . . ,max(A, A'). We also merge the 
blocks to be written by both jobs. 

Processing jobs from the job queue. For each time step, the reshuffling component perform 
w log N work for a predetermined constant w such that w log N is greater than or equal to the amor- 
tized cost of the construction. Part of that work may be consumed by a ConcurrentReadPartition 



operation executing at the beginning of the time step as described in Figure 12. The remaining 
work is performed in the form of jobs obtained from Q. 

Definition 2 (Processing a job). A job (p, X, ft) is performed as a ConcurrentWritePartition(p, A, /?) 

operation that reshuffles levels 0, . . . , A of partition p and writes the blocks in ft to partition p. 



The ConcurrentWritePartition operation is described in Sections 6.1.2 for the practical and recursive 
constructions respectively. Additionally, every block read and written to the server is counted to 
calculate the amount of work performed as the job is running. A job may be paused after having 
completed part of its work. 

Jobs are always dequeued from the head of Q. At any point only a single job called the current 
job is being processed unless the queue is empty (then there are no jobs to process). Each job starts 
after the previous job has completed, hence multiple jobs are never processed at the same time. 

If the current job does not consume all of the remaining work of the time step, the the next job 
in Q becomes the current job, and so on. the current job is paused when the total amount of work 
performed in the time step is exactly wlogN. In the next time step, the current job is resumed 
from where it was paused. 

We now explain how to perform ConcurrentReadPartition and ConcurrentWritePartition opera- 
tions in the practical construction to achieve an O (log N) worst-case cost with high probability. 



6.1.1 Concurrent Reads 

The client performs the ConcurrentRead Partition (p, u) operation by reading or 1 blocks from each 
filled level I of partition p on the server as follows: 

• If level I in partition p contains block u, then 

— Read block u from level t in partition p on the server like in the ReadPartition operation. 

• If level I in partition p does not contain block u and this level is not being reshuffled, then 

— Read the next dummy block from level I in partition p on the server like in the ReadPartition 
operation. 

• If level I in partition p does not contain block u and this level is being reshuffled, then 

— Recall that when level I is being reshuffled, 2^ previously unread blocks are chosen to be 
read. Let S be the identifiers of that set of blocks for level I in partition p. 
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— Let S' C S be the IDs of blocks in 5 that were not read by a previous ConcurrentReadPartition 
operation after the level started being reshuffled. The client keeps track of S' for each 
level by first setting it to S when a ConcurrentWritePartition operation begins and then 
removing u from S' after every ConcurrentReadPartition(p, u) operation. 

— If S' is not empty, the client reads a random block in S' from the server. 

— If S' is empty, then the client doesn't read anything from level t in partition p. Revealing 
this fact to the server does not affect the security of the construction because the server 
already knows that the client has the entire level stored in its reshuffling buffer. 

When u = _L, block u is treated as not being contained in any level. Due to concurrency, it is 
possible that a level of a partition needs to be read during reshuffling. In that case, blocks may be 
read directly from the client's shuffling buffer containing the level. 



6.1.2 Concurrent Writes 

A ConcurrentWritePartition(p, A, j3) operation is performed like the non-concurrent WritePartition 
operation described in Figure [7j except for three differences. 

The first difference is that the client does not shuffle based on the last consecutively filled level. 
Instead it shuffles the levels 0, ...,A which may include a few more levels than the WritePartition 
operation would shuffle. 

The second difference is that at Line [9] of Figure [7l the client adds all of the blocks in (3 to the 
buffer. 

The third difference is at Line [4] of Figure [7j In the non-concurrent construction, client fetches 
the list of 2 e blocks ID's in a level that is about to be reshuffled. It then uses this list to determine 



which blocks have already been read as described in Section 4.5). Because 2 is 0(\/N) fetching 
this metadata in the non-concurrent construction takes 0(y/N) work in the worst case. 

To ensure the worst case cost of the concurrent construction is O(logiV), the metadata is stored 
as a bit array by the client. This bit array indicates which real blocks in that level have already 
been read. The client also knows which dummy blocks have been read because it already stores 
the nextDummy counter and it can apply the PRP function for all dummy blocks between k + 1 
and nextDummy where k is the number of real blocks in a level. Observe that the client only needs 
to store a single bit for each real block on the server. Hence this only increases the client storage 
by 2iV + e\^N bits, which is significantly smaller than the size of index structure that the client 
already stores. 

Theorem 4 (Practical concurrent construction). With 1— ^^rm probability, the concurrent prac- 
tical construction described above has O(logiV) worst-case and amortized cost, and requires cN 
client-side storage with a very small c, and O(N) server-side storage. 

The formal proof of the above theorem is in the full version of the paper [l] . 



7 Recursive Construction 

The practical (non-current and concurrent) constructions described so far are geared towards op- 
timal practical performance. However, they are arguably not ideal in terms of asymptotic perfor- 
mance, since they require a linear fraction of client-side storage for storing a position map of linear 
size. 

For theoretic interest, we describe in this section how to recursively apply our O-RAM con- 
structions to store the position map on the server, thereby obtaining O-RAM constructions with 
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Figure 13: The recursive construction. 



0(y/N) client-side storage, while incurring only a logarithmic factor in terms of amortized and 
worst-case cost. 

We first describe the recursive, non-concurrent construction which achieves 0((log./V) 2 ) amor- 
tized cost, and 0(yN) worst-case cost. We then describe how to apply the concurrency techniques 
to further reduce the worst-case cost to 0((logA) 2 ), such that the worst-case cost and amortized 
cost will be the same. 

7.1 Recursive Non-Concurrent Construction 

Intuition. Instead of storing the linearly sized position map locally, the client stores it in a 
separate O-RAM on the server. Furthermore, the O-RAM for the position map is guaranteed to be 
a constant factor smaller than the original O-RAM. In other words, each level of recursion reduces 
the O-RAM capacity by a constant factor. After a logarithmic number of recursions, the size of the 
position map stored on the client is reduced to 0(1). The total size of all data caches is 0(vN), 
hence the construction uses 0(y/N) client storage. 

For the recursive construction, we employ the Goodrich-Mitzenmacher O-RAM scheme as the 
partition O-RAM. Specifically, we employ their O-RAM scheme which for an O-RAM of capacity 
N, achieves O(logiV) amortized cost and 0(N) worst-case cost, while using 0(y/N) client-side 
storage, and 0(N) server-side storage. 

Definition 3 (0-RAM(3Af). Let O-RAMqm denote the Goodrich-Mitzenmacher O-RAM scheme Wl: 
for an O-RAM of capacity N , the O-RAMqm scheme achieves 0(logN) amortized cost, and O(N) 
worst-case cost, while using 0(y/N) client-side storage, and O(N) server-side storage. 

Definition 4 (0-RAMfc ase ). Let 0-RAMb ase denote the O-RAM scheme derived through the par- 
titioning framework with the following parameterizations: (1) we set P = y/~N denote the number 
of partitions, where each partition has approximately y/N blocks, and (2) we use the O-RAMqm 
construction as the partition O-RAM. 

Notice that in 0-RAMf, ase , the client requires a data cache of size 0{y/N) and a position map 
of size less than 2Nl ° gN blocks. If we assume that the data block size B > 2 log N, then the client 
needs to store at most 2N g g — + yJ~N log = ^ + o(N) blocks of data, where the compression 
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rate a = 2 \ogN > !• To reduce the client-side storage, we can recursively apply the O-RAM 
construction to store the position map on the server side. 

Definition 5 (Recursive construction: O-RAM*). Let O-RAM* denote a recursive O-RAM scheme 
constructed as below. In 0-RAM\, ase , the client needs to store a position map of size cN(c < 1). 
Now, instead of storing the position map locally on the client, store it in a recursive O-RAM on the 
server side. The pseudocode of the O-RAM* scheme would be otherwise be the same as in Figure^ 
except that Line is modified to the the following recursive O-RAM lookup and update operation: 
The position position [u] is stored in block u/a of the smaller O-RAM. The client looks up this 
block, updates the corresponding entry position [u] with the new value r, and writes the new block 
back. Note that the read and update can be achieved in a single O-RAM operation to the smaller 
O-RAM. 

Theorem 5 (Recursive O-RAM construction). Suppose that the block size B > 21ogiV , and that 
the number of data accesses M < N k for some k = 0( l ^ N ). Our recursive O-RAM construc- 
tion achieves 0((logiV) 2 ) amortized cost, 0(y/~N) worst-case cost, and requires O(N) server-side 
storage, and 0(^/N) client-side storage. 

The proof of the above theorem is presented in in the full version of the paper [lj. 
7.2 Recursive Concurrent Construction 

Using similar concurrency techniques as in Section [6j we can further reduce the worst-case cost 
of the recursive construction to 0((log iV) 2 ). Recall that the recursive construction differs from 
the practical construction in two ways: (1) it uses the O-RAMga/ (Goodrich-Mitzenmacher (6l) 
scheme as the partition O-RAM and (2) it recurses on its position map. We explain how to to 
perform concurrent operations in the O-RAMgjv/ scheme to reduce the worst case cost of the base 
construction to 0(logN) with high probability. Then when the recursion is applied, the recursive 
construction achieves a worst case cost of 0((log iV) 2 ) with high probability. 

7.2.1 Concurrent Reads 

As concurrency allows reshuffles to be queued for later, it is possible that a level I is read more 
than 2^ times in between reshumings. The O-RAMqa/ scheme imposes a restriction that at most 
2 £ blocks can be read from a level before it must be reshuffled by using a set of 2 £ dummy blocks. 
We observe that it is possible to perform a dummy read operation instead of using a dummy block 
and performing a normal read on it. This essentially eliminates the use of dummy blocks. Note 
that the same idea was suggested in the work by Goodrich et al. (9l. 

A dummy read operation ConcurrentReadPartition(p, _L) is performed by reading two random 
blocks within a level instead of applying a Cuckoo hash on a element from small domain. Observe 
that the Cuckoo hashes for real read operations output uniformly random block positions. Because 
the blocks read by dummy read operation are also chosen from a uniformly random distribution, 
dummy reads are indistinguishable from real reads. 

This observation allows the client to securely perform a ConcurrentReadPartition(j), u) operation 
as follows. For each level (from smaller to larger) of partition p, as usual the client performs a 
Cuckoo hash of the block identifier u to determine which two blocks to read within the level. Once 
the block is found in some level, the client performs dummy reads on subsequent levels. The client 
always first checks the local storage to see if the block is in the job queue. If so, then the client 
performs dummy reads of all levels. 
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In summary, instead of reading specific dummy blocks (which can be exhausted since there 
are only 2^ dummy blocks in level £), the client performs dummy reads by choosing two random 
positions in the level. 

7.2.2 Concurrent Writes 

In the concurrent recursive construction, the ConcurrentWritePartition(p, A, (3) operation is per- 
formed by first reshuffling levels 0, ...,A along with the blocks in (3 using the oblivious shuffling 
protocol of the O-RAMgm scheme. After the reshuffling has completed, the updated Cuckoo hash 
tables have been formed. 

Theorem 6 (Recursive concurrent O-RAM). Assume that the block size B > 21ogA r . With 
1— po iy( N <) probability, the concurrent recursive construction described above has 0((\og N) 2 ) worst- 
case and amortized cost, and requires 0(y/~N) client-side storage, and O(N) server-side storage. 

The formal proof of the above theorem will be included in the full version of the paper [l] . 

8 Optimizations and Extensions 
8.1 Compressing the Position Map 

The position map is highly compressible under realistic workloads due to the natural sequentiality 
of data accesses. Overall we can compress the position map to about 0.255 bytes per block. Hence 
its compressed size is 0.225iV bytes. Even for an extremely large, 1024 TB O-RAM with A?" = 2 32 
blocks, the position map will be under 1 GB in size. We now explain how to compress the position 
map. 

Compressing partition numbers. In 111], Oprea et al. showed that real- world file systems 
induce almost entirely sequential access patterns. They used this observation to compress a data 
structure that stores a counter of how many times each block has been accessed. Their experimental 
results on real-world traces show that the compressed data structure stores about 0.13 bytes per 
block. Every time a block is accessed, their data structure stores a unique value (specifically, a 
counter) for that block. In our construction, instead of placing newly read blocks in a random cache 
slot, we can place them in a pseudo-random cache slot determined by the block id and counter. 
Specifically, when block % is accessed for the j'th time (i.e., it's counter is j), it is placed in cache 
slot PRF(i.j). PRF(-) is a pseudo-random function that outputs a cache slot (or partition) number. 

Compressing level numbers. Recall that each partition contains L levels such that level I 
contains at most 2^ real blocks. We can represent the level number of each block by using only 1 
bit per block on average, regardless of the number of levels. This can be easily shown by computing 
the entropy of the level number as follows. If all levels are filled, each block has probability 2^~ L 
of being in level I. Then the entropy of the level number is 

-x:2 £ - L iog 2 (2^)=x;.-2-<i 

If not all levels in a partition are filled, then the entropy is even less, but for the sake of simplicity 
let's still use 1 bit to represent a level number within that partition. Note that since the top level 
is slightly larger (it contains e extra blocks), the entropy might be slightly larger than 1 bit, but 
only by a negligible amount. 
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Using the compressed position map. These two compression tricks allow us to compress our 
position map immensely. On average, we can use 0.13 bytes for the partition number and 0.125 
bytes (1 bit) for the level number, for a total of 0.255 bytes per block. 

Once we have located a block's partition p and level £, retrieving it is easy. When each level 
is constructed, each real block can be assigned a fresh alias PRF(if[p, £], "real-alias", u) where u is 
the ID of the block and PRF is a pseudo-random function. Each dummy block can be assigned the 
alias PRF(K[p,£], "dummy-alias" , i) where i is the index of the dummy block in partition p, level 
i. Then during retrieval, the client fetches blocks from the server by their alias. 

8.2 Reducing Server Storage 

Each partition's capacity is \pN + e blocks, where the surplus e is due to the fact that some 
partitions may have more blocks than others when the blocks are assigned in a random fashion to 
the partitions. A partition has levels I = 0, . . . , log 2 y/~N. Each level contains 2 • 2^ blocks (real and 
dummy blocks), except for the top level that contains 2e additional blocks. Then, the maximum 
size of a partition on the server is 4V^V + 2e blocks. Therefore, the maximum server storage is 
AN + 2eVN. 

However, the maximum amount of server storage required is less than that, due to several 
reasons: 

1. Levels of partitions are sometimes not filled. It is extremely unlikely that at some point in 
time, all levels of all partitions are simultaneously filled. 

2. As soon as a block is read from a level, it can be deleted by the server because its value is no 
longer needed. 

In our simulation experiments, we calculated that the server never needs to store more than 
3.2A?" blocks at any point in time. Hence, in practice, the server storage can be regarded as being 
less than 3.2iV blocks. 



8.3 Compressing Levels During Uploading 

of the WritePartition algorithm in Figure [7J the client needs to write back up to 2\fN + 



In Line 
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o(v N) blocks to the server, at least half of which are dummy blocks. Since the values of the dummy 
blocks are irrelevant (since the server cannot differentiate between real and dummy blocks), it is 
possible to use a matrix compression algorithm to save a 2X factor in terms of bandwidth. 

Suppose that the client wishes to transfer 2k blocks b := (b%, 62, • . . &2fc)- Let S C {1, 2, . . . , 2k} 
denote the offsets of the real blocks, let bs denote the vector of real blocks. We consider the case 
when exactly k of the blocks are real, i.e., 65 is of length k (if less than k blocks are real, simply 
select some dummy blocks to fill in). To achieve the 2X compression, the server and client share a 
Vandermonde matrix M2kxk during an initial setup phase. Now to transfer the blocks b, the client 
solves the linear equation: 

Ms • x = bs 

where M5 denotes the matrix formed by rows of M indexed by the set S, and bs denote the vector 
B indexed by the set S, i.e., the list of real blocks. 

The client can simply transfer x (length k) to the server in place of b (length 2k). The server 
decompresses it by computing y <— Mx, and it is not hard to see that ys = bs- The server is unable 
to distinguish which blocks are real and which ones are dummy, since the Vandermonde matrix 
ensures that any subset of k values of y are a predetermined linear combination of the remaining k 
values of y. 
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Appendices 



A Bounding Partition Capacity and Cache Capacity 
A.l Bounding the Partition Capacity 

We now bound the partition capacity. We will think of the i-th cache slot of the client as an 
extension of the i-th partition on the server. We will bound the maximum number of data blocks 
in the partition plus corresponding cache slot. This is obviously an upper-bound on the capacity 
of the corresponding partition. 

We assume that the data access sequence is independent of the random coins used in the O-RAM 
scheme. Notice that this is also the case in practical applications. 

Suppose we are given a data access sequence: 

x := (xi,x 2 , ■ ■ -,x M )- 

If we think of the partition and its corresponding cache slot as a unity, the operations of the O- 
RAM is equivalent to the following random process: Initially, each of the A blocks is assigned to an 
independent random partion. In each time step, pick an arbitrary element from its corresponding 
partition, and place it in a fresh random partition. 

Henceforth, when we perform probabilitic analysis on this random process, we assume that the 
probability space is defined over the initial coin flips that place each block into a partition, as well 
as the coin flips in each time step t that place xt into a random partition. 

Fact 1. At any point of time, the distribution of blocks in the partitions (and their extended cache 
slots) are the same as throwing N balls into y/~N bins. 

Fact 2. If we throw s = N balls randomly into t 
Pi[a specific bin > k balls] < 

Therefore, 




/N+k 



Pr[a specific bin > vJV + k balls] < —= =1 ■= < 



N + k V (VN + k)/k 



A \ ' ' / I \ 1 



e fc 



Theorem 7. Let k > 0. At any time t G N, with probability 1 — o(l), the number of blocks in the 
i-th partition Zi is bounded by y/~N + kin N, i.e., 

Pr[Z, > VN + klnN] < ± 



Proof, (sketch.) Straightforward from the above calculation. 



□ 



Theorem 8 (Partition capacity). Let the total number of data requests M < N k for some positive 
k. Given any sequence x of length M , let Z^t denote the load of the partition i at time step t. 
Then, 



Pr 



3i G [N] , t G [M] : Z M > VN + (jfc + c) In N 



< 



1 

A^ 
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p(l-q) p(l-q) p(l-q) 

i+1 

q(l-p) q(l-p) q(l-p) 

Figure 14: Discrete-time Markov Chain for each cache slot. 





Proof, (sketch.) Due to Theorem [8j and apply union bound over all partitions and all M time 
steps. □ 



Given Theorem|8j it suffices to set each partition's capacity to be v iV + o(v N). In other words, 
each partition is an O-RAM that can store up to \/~N + o(-\fN) blocks. 



A. 2 Bounding the Client Data Cache Size 

In the following analysis, we assume the Random Evict algorithm with an eviction rate v = 2. 

Recall that in our partitioning framework, eviction of blocks from the data cache can happen 
in two ways: 1) piggy-backed evictions that happen together with regular read or write operations; 
and 2) background eviction. For the sake of upper-bounding the client data cache, we pretend that 
there are no piggy-backed evictions, and only background evictions. This obviously gives an upper 
bound on the data cache size. 

Now focus on a single slot. A slot can be viewed as a discrete-time Markov Chain as shown in 
Figure 
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Basically, in every time step, with probability p 

2 



j=, a block is added to the slot, and 



with probability q = -^=, a block is evicted (if one exists). 



< \- Let 7Tj denote the probability that a slot has i blocks in the stationary 
distribution. Due to standard Markov Chain analysis, we get: 



Let p 



Fact 3. At any given point of time, the expected number of blocks in each slot is jz^, and the 

expected number of blocks in all data cache slots is f^,, where P is the number of partitions (or 
slots). 

Definition 6 (Negative association (31). A set of random variables Xi,X2, ■ ■ ■ ,X^ are negatively 
associated, if for every two disjoint index sets, I,JQ [k], 

E[f(X h ie I)g{Xj,j G J)] < E[f(X l ,i G I)]E[g(Xj,j G J)] 

for all functions f : M^l -> R and g : M) J \ -> R that are both non-decresing or both non-increasing. 

Proposition 1. Any any given point of time, let Zi denote the number of blocks in slot i G [P]. 
Then, the random variables {Zi} ie ^ are negatively associated. 

Proof. Let Bij (i G [P],j G [M]) be indicator random variables defined as below: 



B 



1 if a block is placed in slot i in the j-th time step 
otherwise 
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In each time step, v slots will be randomly chosen for eviction. Let Cjj,^ (i £ [P],j £ [-W]>& G 
[u]) be the indicator random variables defined as below: 



J,k 



if slot i is selected for background eviction in time step j , k-th eviction 

1 otherwise 



Notice that in the above, the indicator Cj j & is (rather than 1) if slot i is selected for background 
eviction in time step j. This is an intentional choice which will later ensure that the number of 
blocks in a slot is a non-decreasing function of these indicator random variables. 

Due to Proposition 11 of [3], the vector {Bi,j}ie[P],je[M] is negatively associated. 

Similarly, due to the Zero-One lemma of |3j, the vector {1 — Cij t k}ie[P],je[M],ke[u] is negatively 
associated, and hence, the vector {Cij,kyie[P],je[M] t ke[v] 1S negatively associated, 

Now, since all Bij and {Cjj^j's are mutually independent, the set of variables j e [M]U 
{Ci,j,k}ie[P],je[M],k£[u] is negatively associated due to Proposition 7 of |3|. 

For any M G N, after M time steps, the number of balls Zi in slot i where i £ [P] is a non- 
decreasing function of Zi := ^i({-Bij}je[M]) {Cij,fc}ie[M],fee[i/])- Due to Proposition 7 of [3j, the 
number of blocks in each cache slot, i.e., the vector {^j}ig[p] is negatively associated. □ 



Lemma 2 (Tail bound for sum of geometric random variables) . Let X\ , X2 , . . . , X& be independent 



geometric random variables, each having parameter p, and mean -. Let X := Y2i=i-^-i> ^ M ~~ 



Then, for < e < 1, we have 

Pr[X> (1 + eV] <exp(-^) 

Proof. Use the method of moment generating functions. Suppose e* < j^r } 

E[exp(tXi)] =pe* + p(l - p)e 2 * + p(l - p) 2 e 3i . . . 
00 

F i=l 

P ^ 

= T^ ' 6 (1 " P) ' l-e*(l-p) 
pe* 



1 - e*(l-p) 



E [ex P (A-)] =nE[exp(« i) ] = 



MP 
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Therefore, 



Pr[X > (1 + e)n] = Pr[exp(tX) > exp(t(l + e)/i) 

< 



pe l 


l-e*(l 


-P) 


pe l 




l-e*(l 


-P) 


P 





MP / | 



e t(l+e)^ 
\W / j \MP 

MP / j \ MP 



p L_„ 

l-e*(l-p) ' JC-^-i) 



The above inequality holds for all t such that e* < j^— . We now pick the i to minimize Pr[X > 
(l+e)fi\. It is not hard to see that the above is minimized when e* = (i^^-p) • Therefore, plugging 

e* = (i+t) e (T- P ) back in > 



Pr[X > (1 + e)fi] < 



\ 



hp 



l+e-p 
1+e 



(1 + ^) 



{(l+e)(l-p)) 

Cl+e i\\ MP 

;i + e)(l-p)\ ( ^" 1) \ 



1 + e-p 



= ^(1 + ^1 

< ((l + eJPe-^)" 
l + e x A 



1 + e-p 



l+e-p\ V 



Specifically, for < e < 1, ^ < exp(— Therefore, for < e < 1, we have: 



Pt[X > (l + e)/i] < exp 




□ 

Lemma 3 (Tail bound for sum of negatively-associated geometric random variables). Let X±, X2, ■ ■ . , 

be negatively associated geometric random variables, each having parameter p, and mean ^ . Let 

X := Yh=\ x i> let (i= ^. Then, for < e < 1, we have 

e 2 k 

Pr[X> (l + e)/i] <exp(-— ) 
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Proof. Similar to the proof of Lemma [2] Observe also that for negatively associated random 
variables Xi, X2, • ■ • , X^, for t > 0, we have 

k k 

E[exp(tX)] = E[Y[exp(tXi)} < JjE[exp(tXi)] 

i=l i=l 

The remainder of the proof follows directly from the proof of Lemma [2} □ 

Fact 4. At any given point of time, let Zi denote the number of blocks in slot i £ [P], where P 
is the number of partitions (or slots). Then, Y{ := Z{ + 1 is a geometrically distributed random 
variable with mean j—. Moreover, as the random variables {-^i}ie[P] are negatively associated, so 
are {Yi} ie[P] . 

Lemma 4. Let Z denote the total cache size at any point of time after the chain has reached 
stationary distribution. Let P = \/N denote the number of partitions (or slots). From Fact^ 
E[Z] = < \/ r N. We have the following tail bound: 

Pr[Z > y/N + 4ciV3 vlrTiVl < — =■ 

~~ N c 



Proof. Let Y = J2t=i Y i = Ei=i( z i + x ) denote the sum of negatively associated geometric random 



variables Yi, with mean j^— . Therefore E[Y] 



VN 
l-p 



< 2VN 



Let 



Pr[Z > E[Z] + 4ciV2 VhTiV] = Pr[y > E[Y] + AcN^^N} 
AcN^VhiN 2cVlniV 



e := 



E[Y] 



> 



Notice that < e < 1 for sufficiently large N. Therefore, 

Pr[Z > E[Z] + AcmVh^N] 
= Pr[F > E[Y] +4ciV3v / lniV] 



Pr[Y > (l + e)E[Y]] < exp(- 



, 4c 2 In AT. 
< exp( ) 



N c 



□ 



Theorem 9 (Data cache capacity). Let the total number of data requests M < N k for some k > 0. 
Let Z(0,t) denote the total number of blocks in the client data cache at time t 6 [M], assuming that 
the system intitally starts in the state 0, i.e., all slots are empty initially. Then, for c > 0, 



Pr 



max Z(6,t) > VN + 4:Vk + c- N^VlnN 
te[M] 



< 



N c 
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Proof, (sketch.) Assume that the mixing time of the Markov Chain for each slot is T. 

We show that with high probability, the data cache size never exceeds y/N + o(x/~N) between 
time T and T + M. To show this, simply use Lemma |4j and take take union bound over time steps 
T through T + M: 



Pr 



max Z(0.t) 

te[T,T+M] 



> VN + 4\/k + c- mVlnN 



< 



N c 



Furthermore, notice that the cache size is less likely to exceed a certain upper-bound when each 
slot starts empty, than starting in any other initial state. One can formally prove this by showing 
that for every sample path (i.e., sequence of coin flips), starting in the empty state never results in 
more blocks in the client's data cache than starting in any other state. 
Therefore, 



Pr 

<Pr 
1 



< 



max Z(6, t) > VN + AVk + c • VlnN 

te[M] 



max Z(6, t) > \/N + AVk + c • N Win N 

_te[T,T+M] 



□ 



B Recursive Construction Costs 

Proof of Theorem [5| The recursive O-RAM construction is obtained by recursively applying the 
O-RAM* construction O(logA^) times to the position map. 

• The amortized cost of the O-RAM* construction is O(logiV). The recurrence equation for 
the amortized cost is: 

T{N) = T(N/a) + O(logiV) 
which solves to T(N) = 0((logiV) 2 ). 

• The worst-case cost of the O-RAM* construction is 0(y/~N). The recurrence relation of the 
worst-case cost is 

T(N) = T(N/a) + 0{VN) 

which solves to T(N) = 0(y/N). 

• With high probability, each partition's capacity in the O-RAM* construction will not exceed 
0(\/N) (Theorem and hence the total server-side storage of the O-RAM* construction is 
O(N). The recurrence relation for the server-side storage is 

T(N) = T(N/a) + 0{N) 

which solves to T(N) = O(N). 

• With high probability, the client's data cache will not exceed 0{\/N) (Theorem [9| in the 
O-RAM* construction. The recurrence relation for the client-side storage is 

T(N) = T{N/a) + 0(VW) 
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which solves to T(N) = 0(y/N). 

Notice that the O-RAM construction by Goodrich et al. [6], which we use for the partition 
O-RAM requires a shuffling buffer of size 0{s/N) for an O-RAM of capacity N. But this 
reshuffling buffer can be shared across all partitions at all levels of recursion. Therefore, it 
does not increase the client-side storage asymptotically. 

□ 



C Security of the Practical Partition O-RAM 

Here we prove the security of the partition O-RAM construction of our practical scheme. 

Lemma 5. The ReadPartition operation will never read the same block from a level more than once 
before that level is reshuffled. 

Proof. A block is read from a level for one of two reasons. It is either a real block that the client 
wants to read from the partition, or it is a dummy block. If it is a real block, then the real block 
will be placed in the client's cache and the next time that same real bock is being read, it will be 
read from its new location. If the block is a dummy block, then after reading it, the nextDummy 
counter is incremented, ensuring that another dummy block will be read the next time a dummy 
must be read from that level. When a level is reshuffled, it either becomes unfilled or it is filled 
with a new set of blocks. □ 

Theorem 10. The ReadPartition operation causes blocks to be read from the server pseudo-randomly 
without replacement, hence independently from the data request sequence. 

Proof. The ReadPartition operation reads a block from each level of the partition. Since the blocks 
in a level are pseudo-randomly permuted by applying the PRP function, each block read is pseudo- 
randomly chosen. By Theorem [5| the block read from each level will be a previously unread block, 
so the blocks are read without replacement. □ 

Lemma 6. A reshuffling of a set of levels leaks no other information to the server except that a 
reshuffling of those levels occurred. 

Proof. A reshuffling operation always reads exactly 2^ unread blocks from a level t on the server, 
shuffles them on the client side, and uploads them to the first unfilled level (or the top level if all 
levels are filled). Since the server always knows which levels are filled, which blocks have been read, 
and cannot observe the shuffling that happens entirely on the client-side, then the server learns 
nothing new. □ 

Theorem 11. The WritePartition operations leak no information about the data request sequence. 

Proof. After exactly every 2 e WritePartition operations, levels 0,1, ... ,£ are reshuffled. Therefore 
the reshufflings happen at regular deterministic intervals. The reshufflings are the only operations 
caused by WritePartition that are observed by the server. By Lemma [6j the reshuffling operations 
leak no information to the server except that the reshufflings happened. Since the reshufflings 
happen at deterministic intervals independent of the data request sequence, the WritePartition 
operations leak no information about the data request sequence. □ 
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D Concurrent Constructions: Proof of Worst-Case Cost 



Consider level i (1 < i < [log Q N\) of the recursion, let S be the maximum capacity for each 
partition, and let P denote the number of partitions. For the sake of asymptotic proofs, we focus 
on 1 < i < I |_l°ga -^J ~~ since the O-RAM capacity at level \ |_log Q iVj (or lower) is bounded by 
0(\/N). Therefore, we assume that the recursions stops at level ^ |_1°S» — 1> an d that that the 
client simply stores level ^ |_l°Sa -^J locally. 

D.l Distribution of Amount of Work Queued for One Partition 

We now analyze the distribution of the amount of work queued for one specific partition at any 
point of time, after the system reaches steady state. Imagine that the amortizer schedules work 
fast enough, such that all all jobs will be dequeued in at most r time steps, r is a parameter to be 
determined later. 

Consider the following stochastic process: Let C p S 7L$ denote a counter for partition p, where 
S is the maximum capacity of each partition. At every time step, flip an independent random coin 
r <— UniformRandom(l..i :, ), and let C r <— C r + 1 (mod S). 

We shall first focus our analysis on a single given partition. The stochastic process for a given 
partition can be described as below. Basically, let C denote the counter for the given partition. 
At every time step, we flip an independent random coin which comes up heads with probability -p. 
We let C <— C + 1 (mod S) if the coin comes up heads. 

Whenever the counter C reaches a multiple of 2 l for some positive integer i (but C is not a 
multiple of 2* +1 ), a shuffling job of size 2* is created and enqueued. Furthermore, whenever a 
job of size 2 l is created and enqueued, all existing jobs of size < 2 l are immediately cancelled and 
removed from the queue. All jobs will be dequeued after at most r time steps. 

Fact 5. Consider a given partition. Suppose that jobs in the queue are ordered according to the 
time they are enqueued. Then the sizes of jobs in the queue are strictly decreasing. 

Fact 6. At any point of time, for a given partition, suppose that the largest job in the queue has 
size 2*, then the total size of all jobs in the queue is bounded by 2 • 2*. 

Henceforth, we refer to the total size of all unfinished jobs as the amount of work for a given 
partition. 

For a given partition, the above stochastic process can be modelled by a Markov Chain. Each 
state is denoted by the tuple (c, Q) where c is the counter value, and Q denotes the current state 
of the queue. Specifically, let Q := {(si, i«)}ieM- — ^ — Rog'S'l denotes the current queue length. 
The sequence of Sj's represent the job sizes and are strictly decreasing according to Fact [5} Each 
< ti < t represents the time elapsed since the i-th. job was enqueued. The transitions of the 
Markov Chain are defined as below: 

Let (c, Q) denote the current state, where Q := {(sj, ti)} iG \m. 

• If c + 1 (mod S) is a multiple of 2' for some positive interger i. Let i denote the largest 
positive integer such that c + 1 is a multiple of 2*: 

- With probability js, transition to state (c + I (mod S),Q') where Q' := {(2*, 0)}U{(s, t+ 
l)\(s,t) £ Q and t < r - 1 and s > 2*}. 

— With probability 1 — p, transition to state (c,Q') where Q' := {(s,t + l)\(s,t) G 
Q and t < t — 1}. 
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• If c + 1 (mod S) is not a multiple of 2 % : 

— With probability -p, transition to state (c + 1 (mod 5), Q') where Q' := {(s, t+l)\(s, t) G 
Q and t < r — 1}. 

— With probability 1 — -p, transition to state (c,Q') where Q' := {(s,t + l)\(s,t) G 
Q and i < r — 1}. 

It is also not hard to see that the Markov Chain is ergodic , and therefore, has a stationary 
distribution. 

Let 7Tj denote the probability that the largest job in a given partition has size 2\ According to 
Fact [6j the total amount of work for that partition is then bounded by 2 • 2\ Since the Markov 
Chain has a stationary distribution, 7Tj is both the time average and ensemble average in its steady 
state. 

Lemma 7. The probability that the largest job in a given partition has size 2 % 7Tj = O(p^r). 

Proof. We use time average to obtain an upper bound on 7Tj. Suppose we let the Markov Chain 
run for a really long time T. Let Xi denote the indicator random variable that the Markov Chain 
goes from (c, *) to (c + 1, *) during time step i G [T]. 
Let X(T) := X^e[T] ^ * s n °t hard to see that 

E[X(T)} = I 

Due to the central limit theorem, as T -)■ oo, Pr[X(T) < 2E[X(T)]] -)■ 1. 

If the counter has advanced X times, then it means that the counter has completed at most |"^] 
cycles. In each cycle, there are at most time steps in which largest job has size 2 l . Therefore, 
over |"^] cycles, there are at most time steps in which the largest job has size 2\ 

Therefore, 

Vl< i < \logS] : m = lim i • 2X(T) ' T < - ■ — ■ 2E[X(T)} = — 
i&i T 2% ~ t 2 % P • 2 % 

□ 

Lemma 8. In steady state, the expected size of the largest job for a given partition is upper bounded 
by ^logS. Moreover, the expected amount of work for a given partition is upper bounded by 

y. log.v. 

Proof. Straightforward from Lemma [7] and Fact [6} □ 



D.2 Bounding Total Amount of Work Queued for All Partitions 

Lemma 9 (Measure concentration for independent random variables). Let Xx,X2, . . . , Xp denote 
i.i.d. random variables. Let A G N, suppose that each Xj (j G [P]) has non-zero probability for 
the range of integer values within [0,2 A ]. Furthermore, the distribution of each Xj satisfies the 
following. For each integer 1 < i < X: Pr[Xj = 2 l ] < ^ for some (3 > 0. Let X = X^j=i Xj- It is 
not hard to see that E[X] = P ■ E[Xi] < (3XP. Furthermore, 

Pr[X > (3XP + 2 • 2 A • log j] < 5 
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Proof. For j E [P] and i E [A], let Y^j denote the following random variable: 



Y 



1 if Xj = 2 l 
otherwise 



Let Y i = 2 i ■ Y£ =1 Y j}i . It is not hard to see that E[Yi\ = V ■ Yjj=i H Y jA = P p - Let q = f . For 
7 S [P], we have: 



7/ 7' 



where the last inequality above is due to Stirling's formula. Suppose 7 = qP + e, we have: 

/ qP \ qP+e ( 1 \ qP+e 

Notice that X = X^=i Therefore, due to union bound, 

A 

<) ' 



Pr[X > (3\P + 2 • 2 A • log -1 < Pr[X > (3XP + log ^ • V 2 l ] < <5 

1=1 



□ 



Lemma 10 (Total amount of work over all partitions). At any given 'point of time, let Xj denote 
the maximum job size for the j-th partition. From Lemma^ we know that for 1 < i < [logS 1 ], 
l<j<P, Pr[Xj = 2*] < ^, Let X := Z je [P] X r Th ^ 

1 Q 

PrLY>8rlog5 + 3-S-log-^H < 5 



(Due to Fact^ the total amount of work across all partitions is bounded by IX.) 

Proof. Let /3 = A = logS". The proof is similar to that of Lemma [9] , except that the Xj's are 
no longer independent. 

We use negative associativity [3] to deal with this problem. For technical reasons which will be- 
come clear later, we will consider the sum of all but 1 partitions. Specifically, let X := X^e[P-i] X r 
Hence, X = X + Xp. Since the maximum job size for the P-th partition can be at most S, it 
suffices to prove that 

Pr[X>8rlog5 + 2-S-log-^f-] < 5 





Similar to the proof of Lemma [9j for j E [P] and i E [A] , let Yjj denote the following random 
variable, but with a minor change. Basically, instead of letting Yji = 1 when Xj = 2 l , we now let 
Y jti = 1 when Xj > 2\ 

[ 1 if Xj > 2 l 

Yj A = < 3 ~ (1) 

I otherwise 

Let Yi := 2 i ■ Eje[P-i] Y i4- Observe that in this case, X < J^i=i Y ^ and Fl i Y j,i = 1] = Pr P^' > 
2 l ] <2Pv[Xj = 2 l ] < 
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If we can prove that Yi i, Y%^, ■ ■ ■ , Yp-i,i are negatively associated (which we will show later), 
then we will have the following: 



Pr[Fi > 7 • 2*] < Yl Fr ^i G S : Y ^ = 

SC[P-1],|S|=7 

J2 Pr[Vj G 5 : > 1] 

SC[P-1],[S[=7 

< Pr[^,« > 1] (due to negative associativity, Proposition 4 in 

sc[p-i\,\s\=>yjes 

= e n pr ^ =i ] 

sc[p-i],|s|= 7 je5 



7 / V7 
where q : = J^. 

From here on, the remaining proof would be similar to that of Lemma |9| 

So far, we have shown how to prove this lemma assuming that Yy^, Y^j, . . . , Yp-i,?, are negatively 
associated. In the remainder of this proof, we will show that this is indeed the case in the following 
lemma. □ 

Lemma 11. For any i £ [log S], the random variables Y± t i, l2,i ; ■ • • , ^P-i,; a s defined in Equation^ 
are negatively associated. 

Proof. Suppose we run for M time steps - M is large enough such that the system reaches steady 
state, and we examine the random variables Yj^s at the end of M time steps. 

To show that after M time steps, Y\^, 3f2,i> • • • > yp-i,i ar e negatively associated, we first argue 
that after M — t steps, the counters of each partition C±, . . . , C v -\ (mod S) are independent. To 
see this, think of the vector (ci, C2, . . . , cp_i) as the configuration of counters of all but one partitions 
at time t. From time t to t+1, the configuration vector (ci, C2, . • • , Cj_i, Cj + l(mod 5), Cj+i, . . . , c p _i) 
can transition into any neighboring vector with probability p, and self-loops with probability -p. 
It is not hard to see that configuration vector forms an ergodic Markov Chain. In addition, in 
the stationary distribution, all configurations are "equivalent" , so all configurations have the same 
probability. It is not hard to see now that the counters (Ci, . . . , Cp_i) are independent random 
variables at time step M — t. Note that the reason why we left one partition out is because the 
counters the counters (Ci, . . . ,Cp) form a periodic Markov Chain (i.e., the sum of all counters 
should be equal to the number of time steps mod S), therefore, a stationary distribution does not 
exist. 

We now argue why after M time steps, li^l^i, • • • , yp-i,i are negatively associated. 

For j £ [P — 1], t £ [M — t + 1, M], let Bjj be indicator random variable as defined below: 



B j,t 



1 partition j is chosen in the i-th time step 
otherwise 



Due to Proposition 11 of 13], the vector {Bj ;t }j & [p-i]^[M-T+i,M] ^ s negatively associated. Earlier, 
we argued that the partition counters mod S for all but one partition are indepdent. Using a 
similar argument, we can prove that the partition counters (C%, C2, ■ ■ ■ , Cp-i) (mod 2 l ) are inde- 
pendent. Since the variables {-Bj,t} j6[p-i] t6[Af-T+i,Af] are independent of the partition counters 
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(Ci, C2, . . . , Cp-i) (mod 2 l ) at time M — r, we can conclude that the union of the partition coun- 
ters (Ci,C2, . . . ,Cp__i) (mod 2 l ) and the vector {Bj t t}je[P-i],te[M-r+i,Ai] is negatively associated. 
Without loss of generality, we assume that every job will remain in the queue for exactly r amount 
of time, even if the job may be completed before that. 

At the end of M time steps, the random variable Yj t i is a non- decreasing function of the partition 
counter Cj (mod 2 l ) at time M — r, and {Bj : t}te[M~r+i,M]- Due to Proposition 7 of [3j, the vector 
(Yj,i)j&[P-i] is negatively associated. Furthermore, this holds for all i G [A]. □ 

Amount of work amortized to each time step. We will pick r = P = y/~N. As a result, if 
the amortizer schedules 16 log S work to be done per time step, then with 1 — po ^ N -^ probability, 
we guarantee that each job remains in the queue for r time or less. 

The remainder of this section will bound the additional client cache required for the concurrent 
construction, including the memory necessary for storing cached reads and pending writes. 



D.3 Bounding the Size of Cached Reads 

For the practical construction, if a partition currently has a shuffling job up to level t in the queue, 
and a ReadPartition operation occurs, then the blocks read from levels 1,...,£ of that partition 
needs to be cached, since the same block cannot be read twice. Note that reads for levels I + 1 and 
higher need not be cached. We now bound the size of client cache needed to store these cached 
reads. 

Note also that in the theoretic construction the buffer for cached reads is optional - since we 
use a variant of the Goodrich-Mitzenmacher [6] O-RAM as the partition O-RAM, where we do not 
use any dummy blocks, and the O-RAM can be read an unlimited number of times. In particular, 
every time we read a level, it always appears to the server that two random locations are being 
accessed. 

Without loss of generality, we will assume that every cached read will remain in the client's 
cache for r amount of time. In reality, since all shuffling jobs are guaranteed to be done within 
r time, each cached read will remain in the cache for r time or less. Therefore, this allows us to 
obtain an upper-bound on the size of cached reads in the client's cache. 

We first study the distribution of the size of cached reads for a single partition. At some time 
t + r, the cached reads are accumulated from time steps t to t + r. For each cached read, its size 
is the the logarithm of the size of the largest job at the time the cached read is created. For the 
sake of an upper bound, we will assume that within a window [t,t + r], the size of all cached reads 
are the logarithm of the size of the largest job between time t and t + r. 

Lemma 12. In steady state, let pi denote the probability that the largest job between a time window 
[t, t + r] for a given partition is of size 2 % , for any time t. Let r = P. Then, 

8 

Pi < — 

- 2* 

Proof. Consider the Markov Chain for the counter c for a given partition. 

Let q c denote the probability that the counter is c mod S in the steady state. It is not hard to 
see that q c = 1/S for any 1 < c < S. . 

Let Rt be a random variable representing the counter value at time t. 
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Pr [largest job between t and t + r has size 2 l ] 
< Pr[3A: : Rt- T = k ■ 2 l — a] ■ Prfthis partition is selected > a times in a time window of 2r] 



0<a<2r 

i-f 2T V- 



< V - 



0<q<2t 



El . . 2r 

-•exp(-(a-— )) 

0<a<2r 

When we select r = P, it is not hard to see that 



Pr[largest job between i and t + r has size 2*] < ^ 



□ 



Lemma 13. At any point of time t in steady state, let Xj denote the logarithm of largest job size 
for partition j G [P] in a time window [t, t + t]. Let X = Ylje\P] Then 



Pv[X > E[..Y] + [ VP - 1 ) log ,S'\/log i] < J 



Proof. For technical reasons, we first bound the sum of -Xj's for all but one partitions, i.e., 
X\, . . . , Jp_i. Let X = Y^jVi Xj. Since Xp < log 5", it suffices to prove that 



Pr [X > E[X) + VP log 5^ log ^] < 6 

We first show that Xi,X2, ■ ■ ■ ,Ip_i's are negatively associated, and we then apply Hoeffding's 
inequality which holds for negatively associated random variables. The proof of negatively associa- 



tivity is similar to the proof of negative associativity in Lemma 1 1 , assuming that all cached reads 
will remain in the cache for exactly r time. 
Now, according to Hoeffding inequality, 

/ 2e 2 P 2 \ ( 2e 2 P 

Pi[X > E[X] + eP] < exp — — = exp 



Let e = ^A/log i, we have 



P(bg5)V "V (log 5) 



Pr[X > E[X] + \/PlogSyiogi] < 5 

When T = P, Pi[Xj = i] < ^ for all j £ [P]. Therefore, it is not hard to see that E[X] < 16P. 
Therefore, 

Pr[X > 16P + \/Plog S^log^] < 6 

□ 
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We now bound the size of cached reads at any point of time in steady state. The cached reads 
at time t + r are created during time steps t through t + r. Therefore, it suffices to bound the total 
size of cached reads created during time steps t through t + r. 

Let X := (X\,X2, ■ ■ ■ ,Xp) denote the logarithm of the maximum job size for each partition, 
during time steps t through t + r. If time step t + i where < i < r chooses partition j, then Xj 
number of cached reads are created. 

Lemma 14. Supose we are given that X = x, and let [i = mean{x). For < i < t, letYi denote 
the number of cached reads created in time step t + i. Let Y := X^I=o 



Pr[y > t/i + \fr • log S • y log — \X = x}<5 



Proof. For at time t + i (where i G [0, r]) suppose partition j is chosen. Recall that we assume the 
number of cached reads the number of cached reads Y\ generated in that time step depends on the 
largest job size of partition j between the entire [t, t + r]. As a result, given a fixed X, each Yi 
(where i 6 [0, r]) depends only on which partition is chosen to be read in time step t + i. Therefore, 
the Yi's are independent. 
By Hoeffding bound, 

Pr[y > r/i + re\X = x] < exp 

= exp 



2eV \ 
r(log5) 2 J 
2e 2 r \ 

(log 5)2 J 



Let e = y log j, we have the following: 



Pr[y > Tfi + Vr" • log5 • \/log -\X = x}<5 



□ 



Theorem 12 (Bounding cached reads). Consider the system in steady state. For < i < t, let 
Yi denote the number of cached reads created in time step t + i. Let Y := X^I=o^ - Assume that 
t = P. Then, 



Pr 



y > 16P + 3VPlog5Wlo. 



1 



< 25 



Proof. According to Lemma 13 



Pr[mean(A ? ) > 16 + ^=?y log i] < 5 



Let G denote the event that mean(A ? ) < 16+ 2l °^ \ /log j- Let G denote the event that mean(X) > 



16 + log \. Due to Lemma 



14 



Pr 



j- > | ()r + ^°|^yi gI + v 7.iog5-yiogJ |G 



< 5 
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Therefore, 



< Pr Y > 16r + 



Pr Y > 16r + 



2rlogS 



2rlogS 




log 7 + a/t • log 5" 



log j + Vr ■ log 5 1 




log^ \G +Pr[G] <2<5 



□ 



Plug in r = P, we get the above theorem. 
D.4 Bounding the Size of Pending Writes 

In the concurrent scheme, when a WritePartition operation occurs, the block to be written is added 
to the pending write queue. At each time step, O(l) number of pending writes are created. Since 
we guarantee that all queued shuffling jobs are completed in r time, each pending write will be 
completed in r time. Therefore, the total number of pending writes in the system is bounded by 



O(r). 
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